AdaLiftOver: high-resolution identification of orthologous regulatory elements with Adaptive liftOver

Abstract Motivation Elucidating functionally similar orthologous regulatory regions for human and model organism genomes is critical for exploiting model organism research and advancing our understanding of results from genome-wide association studies (GWAS). Sequence conservation is the de facto approach for finding orthologous non-coding regions between human and model organism genomes. However, existing methods for mapping non-coding genomic regions across species are challenged by the multi-mapping, low precision, and low mapping rate issues. Results We develop Adaptive liftOver (AdaLiftOver), a large-scale computational tool for identifying functionally similar orthologous non-coding regions across species. AdaLiftOver builds on the UCSC liftOver framework to extend the query regions and prioritizes the resulting candidate target regions based on the conservation of the epigenomic and the sequence grammar features. Evaluations of AdaLiftOver with multiple case studies, spanning both genomic intervals from epigenome datasets across a wide range of model organisms and GWAS SNPs, yield AdaLiftOver as a versatile method for deriving hard-to-obtain human epigenome datasets as well as reliably identifying orthologous loci for GWAS SNPs. Availability and implementation The R package and the data for AdaLiftOver is available from https://github.com/keleslab/AdaLiftOver.


The orthologous gene promoters
We downloaded NCBI RefSeq gene annotation files from https://hgdownload.soe.ucsc. edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz and https://hgdownload. soe.ucsc.edu/goldenPath/mm10/bigZips/genes/mm10.ncbiRefSeq.gtf.gz. We utilized the human-mouse homology table from the Jackson Laboratory http://www.informatics.jax. org/downloads/reports/HOM_MouseHumanSequence.rpt and identified 16,374 orthologous genes in total. We defined the promoter region of a gene as 2 kb upstream and 200 bp downstream of the transcription start site (TSS). We explored two sets of chain files: the reciprocal chain files ("hg38.mm10.rbest.chain", "mm10.hg38.rbest.chain") and the UCSC liftOver tool default chain files ("hg38ToMm10.over.chain.gz","mm10ToHg38.over.chain.gz") to determine the mappings between the orthologous promoters. We utilized both the UCSC liftOver command line tool by varying its various tuning parameters and its R implementation "rtracklayer::liftOver()" (a slim version without tuning parameters) to map all human/mouse promoters to the mouse/human genome and overlapped them with the corresponding counterparts. When a human/mouse promoter failed to map, we extended the promoters by 2 kb on both sides and remapped the extended region. In this context, we defined the positive and negative mappings as follows: • True positive (TP): a promoter maps and overlaps with the correct counterpart.
• False positive (FP): a promoter maps but does not overlap with the correct counterpart.
• True negative (TN): a promoter fails to map or overlap with the correct counterpart after extending by 2 kb on both sides.
• False negative (FN): a promoter fails to map but it maps to the correct counterpart after extending by 2 kb on both sides.
The results are summarized in Supplementary Table 1.

The ENCODE datasets in the AdaLiftOver default repertoire for computing epigenomic feature similarities
The complete list of the datasets is provided in Supplementary Table 2.

Evaluating similarity metrics with the ENCODE cCRE data
The orthologous ENCODE cCRE data (v3) from both hg38 and mm10 were downloaded from https://screen.encodeproject.org/. We utilized the R function "rtracklayer::liftOver()" with the UCSC chain file "hg38.mm10.rbest.chain" to map human cCREs to mouse. The 103,529 orthologues were determined by requiring an overlap of at least 50% of the bases between the human cCRE orthologues and mouse cCREs. The human cCRE categories before and after mapping are summarized in Supplementary Figure 1.
With the 103,529 orthologous cCREs, we further evaluated the combinations of three different similarity metrics (Pearson correlation, cosine similarity, and soft cosine similarity 1 ) and three different data formats summarizing the epigenomic signals in the form of peaks (BED format), fold change of normalized ChIP reads over control reads (Fold change BigWig), and p-values comparing ChIP and control reads along the genome (P-value BigWig). We specifically considered: • BED format: we overlapped each cCRE with the list of peaks and quantified the strength of the signal as the percentage overlap with the peaks (numerical value between 0 and 1).
• Fold change BigWig format: we computed the arithmetic average of BigWig signals for each cCRE.
• P-value BigWig format: we computed the geometric average of BigWig signals for each cCRE.
Supplementary Figure 2 illustrates that the soft cosine similarity underperforms compared to the other metrics, and the fold-change BigWig format is outperformed by the p-value BigWig format as benchmarked by the z-scores. We observe that the BED format and the p-value BigWig format yield the best performance when combined with the Pearson correlation and the cosine similarity, respectively. Although the p-value BigWig format (with the Pearson correlation metric) achieves higher average z-scores compared to the BED format (with the cosine similarity metric), it exhibits larger variability across all 6 cCRE categories (Supplementary Figure 2). Therefore, the numerical features derived from the BED format data are as robust and informative as the BigWig formatted data.
The binary features constructed by overlapping the cCREs with the peaks (BED format data) attain similar performance as the numerical features derived from the BED format data (with the cosine similarity metric) (Supplementary Figure 3). A comparison of Supplementary  Figures 2 and 3 indicates that the cosine similarity, which shows less variability than the Jaccard similarity, is a better similarity metric for binary vectors. We also incorporated the Jaccard similarity option in AdaLiftOver.

Discussion on the choice of data formats and similarity metrics
Epigenomic feature similarity. The 67 pairs of BigWig files from the ENCODE repertoire required ∼10 GB disk storage after being compressed from the ∼150 GB raw data to the Run-Length-Encoding structure separated by chromosomes. In contrast, only <50 MB space was required to store the binary signals of the BED files from the same ENCODE repertoire. Querying the BigWig format data, while it theoretically takes O(1) time complexity, was computational prohibitive for small servers and personal laptops. For a scientific computing server with Intel Xeon CPU E5-2680 v4 2.40GHz, it took ∼20 minutes to load the data in R and ∼24 hours to handle ∼100,000 orthologous genomic regions. However, querying the BED format data only took ∼1 minute for the same task. Further memory issues arose as the number of parallel jobs increase, rendering the overall approach using the BigWig numeric signals infeasible. On the contrary, the BED binary signals scaled well on a regular laptop. Therefore, AdaLiftOver leveraged binary features derived from the BED files with the cosine similarity metric for optimizing the disk storage, the computational speed, the scalability, and the performance robustness ( Supplementary Figures 2 and 3).
Sequence grammar feature similarity. The pre-computed occurrence data for the 841 JAS-PAR motifs in the human and mouse genomes occupied ∼200 GB disk storage. Due to the high space requirement and to enable applicability to any model organism genome, we chose the motifmatchr R package, accelerated by C++ libraries, to perform on-demand motif scanning using PFM/PWMs. After quantifying motif occurrences, we used binary features and cosine similarity metric for the sequence grammar.

Parameter tuning for AdaLiftOver
Leave-one-out cross validation (LOOCV) with the ENCODE datasets. For each of the 67 EN-CODE epigenome datasets in Supplementary Table 2, we utilized the remaining ENCODE datasets excluding those from the same tissue as the training data to compute epigenomic feature similarities. For a given set of human query regions (e.g., one of the human peak sets from Supplementary Table 2), we first identified candidate target regions in the mouse by lifting the query regions over to the mouse genome based on sequence conservation (R liftOver). Here, if a query region does not map, it gets extended on both sides and gets lifted over again. Then, these target regions in mouse are overlapped with the matched epigenomic data in the mouse and labeled as orthologous human-mouse pairs if they reside within a peak (Fig. 3c). This generates positive pairs: query region maps to mouse and the resulting target region overlaps with a peak in the matched peak set and negative pairs: query region maps to mouse but the resulting target region does not overlap with a peak in the matched peak set. These sets of positive and negative pairs are utilized for true/false positive calls. Supplementary Figure 4 illustrates that the optimal window size lies between 1 kb and 3 kb when evaluated across a a grid of window sizes by considering the trade-off between AUPR and AUROC. In practice, we suggest using a window size within this 1 to 3 kb range.
At window size of 2 kb, we summarize below the estimated logistic regression coefficients (mean ± s.e.;β 0 ,β 1 ,β 2 denote the estimated coefficients for the intercept, epigenomic feature similarity, and sequence grammar feature similarity, respectively): We used this set of parameters as default for the rest of the manuscript for mapping between human and mouse. With the constraint of filtering at most 50% of candidate target regions, we identified logistic probability score thresholds 0.376±0.0374 that maximize the precision 0.554±0.0371 using the AdaLiftOver training module. We selected the score threshold = 0.4 for the case studies on ATAC-seq and ChIP-seq peaks. The effects of selecting a logistic probability score threshold can be visually investigated with ROC/PR curves. The users may increase/decrease the score threshold if more stringent/lenient results are needed. We also considered an interaction term between the epigenomic feature similarity and the sequence grammar feature similarity in the logistic regression. However, the area under the curves did not yield significant differences with or without an interaction term (Mann-Whitney test p-values: 0.9752 for AUROC, 0.9539 for AUPR). AdaLiftOver implementation allows the inclusion of such an interaction term which might impact the results in other datasets.
The results are summarized in in Supplementary Table 4.
We then downloaded the following ENCODE ATAC-seq/DNase-seq datasets to evaluate the performances of AdaLiftOver with matched tissues/cell types: • mouse-CH12, DNase-seq, ENCFF855RCO.bed • human-GM12878, ATAC-seq, ENCFF470YYO.bed • mouse-MEL, DNase-seq, ENCFF726EZS.bed • human-K562, ATAC-seq, ENCFF558BLC.bed When combining the above epigenome profiles with the default ENCODE epigenome repertoire, we set the weights (w) of the epigenomic features from the matched tissues by a factor of 10 more in the similarity calculations.
We used the corresponding rbest chain files for each of these species. We used a leaveone-out cross-validation strategy similar to that of Subsection 1.6 for the human-mouse experiments of matched epigenomic datasets, with a probability score threshold of 0.4. Furthermore, we also added new modules for the additional species to the AdaLiftOver R package vignette, where we similarly used averages of the logistic regression parameter estimates from folds of the cross-validation experiments.
To provide support for the mapped GWAS SNPs, we asked whether ATAC-seq peaks of relevant cell types overlapped with the mapped GWAS SNPs more than expected by chance. Our calculations took into account background genomic characteristics such as sequence conservation as quantified by the PhyloP scores 2 and genomic coordinates as quantified by the chromosomes that the ATAC-seq peaks reside in. We first binned all the open chromatin regions 3 of the genome into intervals of 100 bp and stratified these with respect to their PhyloP scores discretized at 0.1 resolution for each chromosome separately. Then, we computed the empirical probability of overlap (p i ) with the mapped GWAS SNPs within each strata S i , i = 1 · · · , I, where index i runs over the PhyloP score grid and the numbers of chromosomes. Next, for a given ATAC-seq peak set of size P , we first calculated the expected proportion of overlap with the GWAS SNPs as p 0 = 1 P p i I(p-th peak is in strata i)p i . Using this null proportion, we evaluated the enrichment with a Binomial test that assessed whether the observed overlap of the ATAC-seq peak set with the mapped GWAS SNPs is larger than that can be explained by this expected proportion of overlap.

Hematopoiesis GWAS SNPs
We leveraged 10 blood-related mouse ATAC-seq peaks (Xiang et al., 2020) and fine-mapped hematopoietic GWAS SNPs (Ulirsch et al., 2019) with PIP ≥ 0.1 from 4 blood traits: MCV, MPV, Mono, Lymph. The settings of the methods compared and the enrichment analysis were the same as the SCZ GWAS case. The benchmarking results are summarized in Supplementary  Table 14. The enrichment results for Mono (monocyte count) and MPV (mean platelet volume) are displayed in Supplementary Figure 10.

Bone Mineral Density GWAS SNPs
The promoter regions of BMD genes were defined as 2 kb upstream and 200 bp downstream of TSS. We utilized 1,097 fine-mapped BMD GWAS SNPs from Morris et al., 2019 as well as 2,088 estimated BMD fine-mapped GWAS SNPs from the UK Biobank (SuSIE PIP ≥ 0.1). For negative controls, we utilized a comparable number of 3,601 BMI SNPs from the UK Biobank (SuSIE PIP ≥ 0.05). The settings of the methods compared and the enrichment analysis were the same as the SCZ GWAS case. The regions for enrichment analysis were the 52 BMD gene promoters and the constructed universe consisted of all orthologous gene promoters identified in Section 1.1. The results are summarized in Supplementary Table 15. We further used 116,402 fine-mapped GWAS SNPs from the UK Biobank (SuSIE PIP ≥ 0.001). We didn't run EpiAlignment due to the lack of scalability of this method. The results are summarized in Supplementary Table 16.   (Lu et al., 2019) 0.23 0. 027  --------------∼3 hours -(Run on WebApp) UCSC liftOver (Hinrichs et al., 2006) Figure 2: Comparison of epigenome similarities for different classes of orthogolous cCREs with three different metrics (cor: Pearson's correlation; cosine: cosine similarity; soft cosine: soft cosine similarity) and three different data formats (bed: BED format; bw fc: fold change BigWig format; bw p: p-value BigWig format) for constructing the epigenomic features. A null distribution for each of the similarity scores is estimated by randomly permuting cCREs 10,000 times. Observed similarity scores were transformed into z-scores using the mean and variance estimates of these null distributions.